Spark Jupyter getting started docker compose #295

kevinjqliu · 2024-09-15T19:53:06Z

Description

This PR moves the docker-compose-jupyter.yml file (and the notebooks/ directory), formerly in the top-level directory, into the getting-started/spark/ folder.

The purpose is to unify the "getting started" guides into the same directory.

Fixes #110

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Documentation update
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

docker-compose -f getting-started/spark/docker-compose.yml up

Open the SparkPolaris.ipynb Jupyter notebook
Grab the root principal credentials from the Polaris service and replace in the notebook cell.
Run all cells in notebook

Checklist:

Please delete options that are not relevant.

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

kevinjqliu · 2024-09-15T19:53:56Z

I want to make sure this is something we want to do before proceeding to add more to the PR

cc @collado-mike / @flyrain

flyrain · 2024-09-16T02:56:25Z

Make sense to me. Thanks @kevinjqliu! Do we have any doc for its usage? We may add doc if not.

kevinjqliu · 2024-09-16T16:50:55Z

@flyrain yep i'll have a README in here, similar to the trino one

flyrain · 2024-09-16T16:59:44Z

Sounds good. We will need these doc to be in the Polaris doc site, like this https://polaris.apache.org/docs/overview/. I couldn't find Trino's doc there, this may involve doc publish and link. cc @jbonofre

kevinjqliu · 2024-09-16T17:04:19Z

I see, this is the README for trino. I'll add a similar README for spark.

As a follow-up, we can change the Polaris doc to refer to these guides https://polaris.apache.org/docs/quickstart

collado-mike · 2024-09-16T18:13:43Z

This looks good to me. We should change the name of the compose file to just docker-compose.yml so we don't have to specify the filename in the command line :)

kevinjqliu · 2024-09-16T18:23:35Z

@collado-mike makes sense, will do.

I have a question on slack about unable to assume the role arn:aws:iam::631484165566:role/datalake-storage-integration-role in the notebook, do you mind taking a look?

kevinjqliu · 2024-09-25T17:20:23Z

r? @flyrain @RussellSpitzer @collado-mike

Also opened #319 to update the Polaris doc site once this is merged.

getting-started/spark/README.md

sfc-gh-emaynard · 2024-09-26T18:59:46Z

getting-started/spark/README.md

I'm a bit conflicted about this doc. It feels like it doesn't really teach the reader anything about Polaris, although it does give you a really fast way to get bootstrapped.

yes I'll admit, this README is a filler for now as a way to get spark & polaris up and running quickly

sfc-gh-emaynard · 2024-09-26T19:00:25Z

getting-started/spark/create-polaris-catalog.sh

I wonder if it might be easier to use the CLI here

might be, if you want a spark-shell. I think the jupyter notebook does a good job of explaining a lot of the concepts

Sorry, I meant the polaris CLI instead of using curl

ah i dont know how to use the polaris CLI, so i just copied directly from https://github.com/apache/polaris/blob/main/regtests/run_spark_sql.sh

kevinjqliu · 2024-09-30T17:49:34Z

md check intermittently shows https://redocly.com/docs/cli/installation as 400 error, weird

flyrain · 2024-10-01T04:37:54Z

md check intermittently shows https://redocly.com/docs/cli/installation as 400 error, weird

It's OK to remove the link for now since we’re transitioning to Hugo.

kevinjqliu · 2024-10-01T04:58:34Z

@flyrain just had to run the CI a few times, it's unrelated to this change

kevinjqliu · 2024-10-15T15:06:11Z

.github/workflows/check-md-link.yml

@@ -41,5 +41,5 @@ jobs:
      with:
        use-quiet-mode: 'yes'
        config-file: '.github/workflows/check-md-link-config.json'
-        folder-path: 'regtests, regtests/client/python/docs, regtests/client/python, .github, build-logic, polaris-core, polaris-service, extension, spec, k8, notebooks'


This PR moved notebooks/ from top-level directory into the getting-started/ directory

flyrain

Thanks @kevinjqliu for working on it. LGTM overall. Left some comments and questions.

flyrain · 2024-10-15T18:40:27Z

getting-started/spark/README.md

+```
+
+This will spin up 3 container services
+* The `polaris` service for running Apache Polaris


Nit: could we be more explicit that it starts with an in-memory metastore?

flyrain · 2024-10-15T18:41:42Z

getting-started/spark/README.md

+This will spin up 3 container services
+* The `polaris` service for running Apache Polaris
+* The `jupyter` service for running Jupyter notebook with PySpark
+* The `create-polaris-catalog` service to run setup script and create local catalog in Polaris


local catalog -> a catalog backed by the local file system?

flyrain · 2024-10-15T19:10:16Z

getting-started/spark/create-polaris-catalog.sh

+SPARK_BEARER_TOKEN="${REGTEST_ROOT_BEARER_TOKEN:-principal:root;realm:default-realm}"
+POLARIS_CATALOG_NAME="${POLARIS_CATALOG_NAME:-polaris_demo}"
+
+# create a catalog backed by the local filesystem


I'm not entirely sure if we need this file. Could we handle everything directly within the notebook, like the other operations in SparkPolaris.ipynb? Would it simplify things if we moved the operations there?

we could, but i think its a good idea to separate infra code (this script) from application code (the notebook)

We could initialize catalog in notebook as well. I feel it's more flexible that way. for example, you don't have to worry about the an env variable for catalog name. But I'm OK with either one. Not a blocker for me.

flyrain · 2024-10-15T19:14:24Z

getting-started/spark/README.md

+
+# Getting Started with Apache Spark and Apache Polaris
+
+This getting started guide provides a `docker-compose` file to set up [Apache Spark](https://spark.apache.org/) with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark. 


There is other way to try Spark with Polaris without docker, it's a not a blocker, we can add it later.

kevinjqliu · 2024-10-15T19:29:07Z

Thanks for the review @flyrain, addressed your comments

flyrain · 2024-10-15T20:34:28Z

We cannot merge any PR until #374 is merged.

kevinjqliu · 2024-10-15T20:38:19Z

Thanks for the heads up, I'll rebase once that PR's merged

kevinjqliu · 2024-10-15T21:45:29Z

@flyrain took your advice, moved getting-started/spark/create-polaris-catalog.sh logic into the jupyter notebook.
Also rebased off latest main. I think this PR is good to go. Please take a look!

flyrain · 2024-10-15T21:57:07Z

Thanks a lot for working on it, @kevinjqliu! Thanks all for the review.

kevinjqliu force-pushed the kevinjqliu/getting-started-spark branch from e8f2187 to 92a2ad5 Compare September 25, 2024 16:52

kevinjqliu marked this pull request as ready for review September 25, 2024 16:56

kevinjqliu requested review from eric-maynard and ebyhr as code owners September 25, 2024 16:56

kevinjqliu mentioned this pull request Sep 25, 2024

[FEATURE REQUEST] Update Polaris doc site with new getting started examples #319

Open

sfc-gh-emaynard reviewed Sep 26, 2024

View reviewed changes

kevinjqliu requested review from jbonofre, ashvina, RussellSpitzer, snazy, vvcephei, takidau, jackye1995, flyrain and collado-mike as code owners September 26, 2024 19:06

kevinjqliu force-pushed the kevinjqliu/getting-started-spark branch from e097bb3 to b75d998 Compare September 30, 2024 17:41

kevinjqliu force-pushed the kevinjqliu/getting-started-spark branch from b75d998 to 3eda72b Compare October 3, 2024 19:59

kevinjqliu force-pushed the kevinjqliu/getting-started-spark branch from 3eda72b to 48a9f00 Compare October 15, 2024 14:58

kevinjqliu commented Oct 15, 2024

View reviewed changes

flyrain reviewed Oct 15, 2024

View reviewed changes

flyrain approved these changes Oct 15, 2024

View reviewed changes

kevinjqliu added 14 commits October 15, 2024 14:17

move to getting-started/

89c8164

move ./notebooks

35b8820

rename

af516b9

take from run_spark_sql.sh

143ee1d

use polaris_demo as catalog name

c94c654

use env var POLARIS_CATALOG_NAME

8acb1b5

ignore notebook checkpoints

ccafb2d

fix notebook to work locally

be13e1e

add POLARIS_CATALOG_NAME env var to docker

bec6dff

readme

aa69a5d

disable link check for this line

7c5f659

typo

da0cc16

notebooks dir moved to getting-started/

d44b948

address PR feedback

797fabb

kevinjqliu force-pushed the kevinjqliu/getting-started-spark branch from a0f6c9a to 797fabb Compare October 15, 2024 21:17

kevinjqliu added 2 commits October 15, 2024 14:37

remove script, move to jupyter notebook

c2c3437

update README

708522a

flyrain approved these changes Oct 15, 2024

View reviewed changes

flyrain merged commit ac01e2d into apache:main Oct 15, 2024
5 checks passed

kevinjqliu deleted the kevinjqliu/getting-started-spark branch October 15, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Jupyter getting started docker compose #295

Spark Jupyter getting started docker compose #295

kevinjqliu commented Sep 15, 2024 •

edited

Loading

kevinjqliu commented Sep 15, 2024

flyrain commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

flyrain commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

collado-mike commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

kevinjqliu commented Sep 25, 2024

sfc-gh-emaynard Sep 26, 2024

kevinjqliu Sep 26, 2024

sfc-gh-emaynard Sep 26, 2024

kevinjqliu Sep 26, 2024

sfc-gh-emaynard Sep 26, 2024

kevinjqliu Sep 26, 2024

kevinjqliu commented Sep 30, 2024

flyrain commented Oct 1, 2024

kevinjqliu commented Oct 1, 2024

kevinjqliu Oct 15, 2024

flyrain left a comment

flyrain Oct 15, 2024

flyrain Oct 15, 2024

flyrain Oct 15, 2024

kevinjqliu Oct 15, 2024

flyrain Oct 15, 2024

flyrain Oct 15, 2024

kevinjqliu commented Oct 15, 2024

flyrain commented Oct 15, 2024

kevinjqliu commented Oct 15, 2024

kevinjqliu commented Oct 15, 2024

flyrain commented Oct 15, 2024


		# Getting Started with Apache Spark and Apache Polaris

		This getting started guide provides a `docker-compose` file to set up [Apache Spark](https://spark.apache.org/) with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark.

Spark Jupyter getting started docker compose #295

Spark Jupyter getting started docker compose #295

Conversation

kevinjqliu commented Sep 15, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

kevinjqliu commented Sep 15, 2024

flyrain commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

flyrain commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

collado-mike commented Sep 16, 2024

kevinjqliu commented Sep 16, 2024

kevinjqliu commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Sep 30, 2024

flyrain commented Oct 1, 2024

kevinjqliu commented Oct 1, 2024

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Oct 15, 2024

flyrain commented Oct 15, 2024

kevinjqliu commented Oct 15, 2024

kevinjqliu commented Oct 15, 2024

flyrain commented Oct 15, 2024

kevinjqliu commented Sep 15, 2024 •

edited

Loading